Automated translation of subject matter specific documents

ABSTRACT

Documents in source natural languages are translated into target natural languages using a computer-implemented translation that is configured to operate within the domain of the subject matter of the documents that imposes specialized requirements for translation and readability. Subject matter specific documents typically include domain-specific terminology, are subject to various regulatory guidelines, and have different readability requirements depending on the intended reader. The computer-implemented translation applies machine-learning techniques that deconstruct elements of the subject matter specific document into a standard data structure and perform pre-processing steps to tokenize digitized document text to identify the correct sentence structure and syntax for the target natural language to optimize translation by, e.g., a neural machine translation engine. The text segments that are input into the neural machine translation engine are generated to be semantically meaningful in the target natural language to thereby enhance the understanding of the neural machine translation engine.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. Ser. No. 17/098,812 filed onNov. 16, 2020, now U.S. Pat. No. 11,734,514, which is a continuation ofU.S. Ser. No. 16/276,002 filed on Feb. 14, 2019, now U.S. Pat. No.10,839,164, which claims the benefit of U.S. Provisional ApplicationSer. No. 62/739,541 filed on Oct. 1, 2018, the entire content of whichis hereby incorporated by reference in its entirety.

BACKGROUND

Language translation involves the conversion of sentences from onenatural language (i.e., a language that has developed naturally throughuse, as contrasted with artificial language or computer code), usuallyreferred to as the “source” language, into another language, typicallycalled the “target” language. When performed by a machine (e.g., acomputer) such translation is referred to as automated languagetranslation or machine translation.

SUMMARY

Documents in a source natural language are translated into one or moretarget natural languages using a computer-implemented translation toolthat is configured to operate within the domain of the subject matterthat impose specialized requirements for translation and readability.Subject matter specific documents typically include domain-specificterminology, are subject to various regulatory guidelines, and havedifferent readability requirements depending on the intended reader (forexample, doctor vs. patient, adult vs. child). The computer-implementedtranslation tool applies machine-learning techniques that deconstructelements of a document into a standardized data structure and performpre-processing steps to parse digitized document text to identify thecorrect sentence structures for the target natural language to optimizetranslation by a translation engine such as a neural machine translationengine. The tokens that are input into the neural machine translationengine are generated to be semantically meaningful in the target naturallanguage to thereby enhance the understanding of the neural machinetranslation engine.

Tokens are transmitted over an application programming interface (API)to the neural machine translation engine in a specific order. Machinelearning techniques are applied to post process the translated tokensreturned from the neural machine translation engine to correct ontologyin the semantic domains of Subject matter specific terminology. Thetranslation tool reconstructs the document in the target naturallanguage using the ordered translated tokens with corrected ontology tomaintain characteristics of the original document including, forexample, format, layout, images (e.g., pictures, photographs,illustrations, etc.), and other content (e.g., diagrams, tables, graphs,charts, etc.). The pre-processing techniques (prior to the machinetranslation) and post-processing techniques (after machine translation)vary based on the characteristics of the language and its complexity formachine translation (e.g., word order can be relatively free because ofthe morphology of the German language, Russian is a highly inflectedlanguage, Japanese has different writing systems). The machinetranslated text is subject to adjustments from a human operator througha user interface that is exposed on the translation tool. Theadjustments can be used to improve the translation of a specificdocument and may be used as a machine learning input to improveperformance of the translation tool in general.

In various illustrative examples, the pre-processing of the documentcomprises sentence splitting and simplification, named entityrecognition, fast fuzzy matching with existing translated documents in atranslation memory database, and application of transformationalgrammar. A cascade of finite-state transducers is configured to performsentence splitting that feeds a speech tagger (described below). Thesentence splitting and simplification processing generates orderedtokens that are semantically meaningful. The structure of the processedsentences is typically less complex than the original text.

The named entity recognition processing identifies text that may beexcluded from translation or which requires translation utilizing auser-defined glossary. Named entities can include proper nouns (e.g.,company names, family names, city names), abbreviations, and acronymsthat are matched against data stored in a table or database. The processcan include classification of named entities into pre-defined classes,and extraction of confidential information can be masked.

The named entity recognition processing can further enhanceopportunities for matching tokens to existing translated documents intranslation memory. Use of the translation memory enables some portionsof the original text to be translated based on historical documentsprior to transmission to the neural machine translation engine over theAPI. The matching can be implemented using fuzzy logic in which matchesbetween document text and the translation memory can be less than 100percent.

A speech tagger is utilized to implement the transformational grammarprocessing which identifies parts of speech in the document text whileproviding a single representation of sentences that have a commonmeaning using a series of transformations. The transformations caninclude detecting a passive voice sentence and transforming the detectedpassive voice sentence into an active voice sentence. An indirectsentence form may be detected and transformed into a direct sentenceform. A transformation in which words in a sentence are re-ordered basedon sentence structure requirements of the target natural language mayalso be implemented.

The present computer-implemented translation tool provides improvementsin the underlying operation of the computing device on which it executesby providing for increased translation accuracy. More specifically, theutilization of the pre-processing enables efficient utilization ofprocessing cycles, memory requirements, and network bandwidth bycreating input to the neural machine translation engine that results inaccurate output and reduces the need to redo translations or discardpoor results. The translation tool further enhances the efficiency ofthe human-machine interface on the computing device because the toolproduces a more complete and accurate translation compared withconventional methodologies. The translation tool produces translateddocuments quickly with a high degree of correctness including grammarthat is target language-appropriate with the proper utilization ofspecialized and domain-specific phrases and terms. A human translator'sinteraction with the translation tool can thus be focused on adjustingand refining the new translated document produced by the tool toleverage the translator's time and language expertise to optimaladvantage.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter. Furthermore, the claimed subject matter is not limited toimplementations that solve any or all disadvantages noted in any part ofthis disclosure. It will be appreciated that the above-described subjectmatter may be implemented as a computer-controlled apparatus, a computerprocess, a computing system, or as an article of manufacture such as oneor more computer-readable storage media. These and various otherfeatures will be apparent from a reading of the following DetailedDescription and a review of the associated drawings. The term clinicaltrial will be used throughout the Detailed Description as an example orsubstitute for any subject matter specific document.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an illustrative computing environment in which acomputer-implemented translation tool executes on a computing device;

FIG. 2 shows illustrative details of the computer-implementedtranslation tool;

FIG. 3 shows illustrative details of natural language structuringpre-processing that is utilized in the computer-implemented translationtool;

FIG. 4 shows an illustrative arrangement in which tokens are provided asan ordered input to a neural machine translation engine;

FIG. 5 shows an illustrative finite-state transducer cascade that isused to implement sentence splitting and simplification;

FIG. 6 shows an illustrative named entity recognition system;

FIG. 7 shows an illustrative text matching system that is used toimplement fast fuzzy matching with a translation memory;

FIG. 8 shows an illustrative transformation system;

FIG. 9 shows illustrative manual adjustments to a translated clinicaltrial document that may be utilized as machine learning input;

FIGS. 10, 11, and 12 show illustrative methods for automated translationof clinical trial documents;

FIG. 13 is a block diagram of an illustrative computing device that maybe used at least in part to implement the present automated translationof clinical trial documents; and

FIG. 14 is a simplified block diagram of an illustrative computingdevice that may be used at least in part to implement the presentautomated translation of clinical trial documents.

Like reference numerals indicate like elements in the drawings. Elementsare not drawn to scale unless otherwise indicated.

DETAILED DESCRIPTION

FIG. 1 shows an illustrative computing environment 100 in which a humanoperator or translator 105 employs a computing device 110 that isconfigured to communicate over a communications network 115 with aremote service provider 120 that supports a neural machine translationengine 125. In alternative implementations, the translation engine maybe implemented using a statistical model or a combination of statisticaland neural modeling. In addition, the neural machine translation enginemay be supported locally, for example, by an entity or enterprise thatsupports the computing device and operator or using a combination oflocal and remote support.

The computing device 110 hosts a computer-implemented translation tool130 that may be implemented, for example, as a software application thatexecutes on the device. In alternative implementations, the translationtool may be implemented using hardware, firmware, or a combinationthereof, depending on the needs of a particular implementation of thepresent automated translation of clinical trial documents.

In this illustrative example, the computer-implemented translation toolcommunicates over the network 115 through an application programminginterface (API) with the neural machine translation engine 125. Asdescribed in more detail below, the translation tool sends tokens 140over an application programming interface (API) 135 that are expressedin a source natural language to the neural machine translation engineand receives tokens 145 that are expressed in a target natural languagethat is different from the source. Thus, the neural machine translationengine translates a token from one language (i.e., the source language)to another (i.e., the target language). While this illustrative exampleuses a combination of processing at the local computing device (asindicated by reference numeral 150) and processing by the remote serviceprovider 120 (as indicated by reference numeral 155) to provide acomplete solution for automated translation of clinical trial documents,it is noted that other processing allocations and arrangements may alsobe utilized. For example, the translation tool may be instantiated as aremote or cloud-based application. Various combinations of local andremote processing can be implemented as appropriate for a giventranslation tool implementation.

The computing device 110 comprises an electronic device such as apersonal computer, server, handheld device, workstation, multimediaconsole, smartphone, tablet computer, laptop computer, or the like. Inthe discussion that follows, the use of the term “computing device” isintended to cover all electronic devices that perform some computingoperations, whether they be implemented locally, remotely, or by acombination of local and remote operation.

The communications network 115 can include any of a variety of networktypes and network infrastructure in various combinations orsub-combinations including local-area networks (LANs), wide-areanetworks (WANs), cellular networks, satellite networks, IP(Internet-Protocol) networks such as Wi-Fi under IEEE 802.11 andEthernet networks under IEEE 802.3, a public switched telephone network(PSTN), and/or short-range networks such as Bluetooth® networks. Networkinfrastructure can be supported, for example, by mobile operators,enterprises, Internet service providers (ISPs), telephone serviceproviders, data service providers, and the like. The communicationsnetwork 115 may utilize portions of the Internet (not shown) or includeinterfaces that support a connection to the Internet so that thecomputing device 110 can access data or content and/or render userexperiences supported by the remote service provider and/or otherservice providers (not shown).

FIG. 2 shows illustrative details of the computer-implementedtranslation tool 130 that may be utilized to process one or moreclinical trial documents 205 as inputs that are expressed in a givensource natural language. The translation tool implements multiple stagesof processing including document deconstruction 210, natural languagestructuring pre-processing 215, neural machine translation 220, semanticontology correction post-processing 225, and document reconstruction230. As shown, the output of the translation tool is a corresponding oneor more translated clinical trial documents 235 that are expressed in agiven target natural language. Neural machine translations are currentlyavailable for many language pairs and the number is expected toincrease. While English is commonly one of the languages in a pair,non-English language pairs are also expected to become more widelysupported.

Document deconstruction 210 includes converting the source clinicaltrial documents 205 to a digitized form that uses a standardized datastructure across all documents. The quality of the source materials maybe expected to vary widely in typical implementations. Thus, thedocument deconstruction stage can apply various techniques toaccommodate noise and unwanted artifacts during digitization to improvequality of the input to the translation tool 130. In some cases,relevant descriptive information such as metadata can be collected forthe input clinical trial documents and stored. Such information may beused, for example, for clinical trial document management and otherpurposes.

The natural language structuring pre-processing stage 215 providestokenization of the digitized clinical trial documents 205 to providefor optimized neural machine translation. The pre-processing stage isdescribed in more detail in the description below that accompanies FIGS.3-8 . The neural machine translation stage 220, as noted above, may besupported by interactions with the neural machine translation engine 125supported by the remote service provider 120, as indicated by the dashedline 240. The neural machine translation stage 220 translates the tokensprovided by the pre-processing stage and returns the translated tokensto the semantic ontology correction post-processing stage 225. Inpost-processing, the individual translated tokens from the neuralmachine translation are corrected to account for phrases, terms,acronyms, and other domain-specific language that is used in theclinical trial or medical domains. The document reconstruction stage 230operates to maintain the formatting of the original source document inthe translated output documents 235 in the target language. The documentreconstruction stage can also be configured to persist othercharacteristics across the documents (i.e., from source input to targetoutput) including images (e.g., pictures, photographs, illustrations,etc.), and other content (e.g., diagrams, tables, graphs, charts, etc.).

FIG. 3 shows illustrative details of the natural language structuringpre-processing stage 215 that is utilized in the computer-implementedtranslation tool. The inputs to the pre-processing stage include adigitized clinical trial document 305 that is expressed in a sourcenatural language. The pre-processing stage includes four constituentelements including sentence splitting and simplification 310, namedentity recognition 315, fast fuzzy matching 320, and transformationalgrammar 325. In typical implementations the processing is performedsequentially with sentence splitting performed first, then named entityrecognition and fast fuzzy matching, and followed by transformationalgrammar. However, in alternative implementations, the processing may beperformed in parallel, in a combination of series and parallel, inanother sequence, or in various combinations thereof. Generally,sentence splitting and simplification is the first processing that isperformed which feeds the other constituent elements in thepre-processing stage. The output of the natural language structuringpre-processing stage 215 includes tokens 330 in the source naturallanguage that are arranged as an ordered output 335.

FIG. 4 shows that the ordered output 335 from the natural languagestructuring pre-processing stage is utilized so that ordered tokens 405are provided as an input to the neural machine translation engine 125via the API 135. That is, the output from the natural languagestructuring pre-processing stage is provided token-by-token to preservethe order. The neural machine translation engine translates the tokensfrom source to target language and preserves the order when supplyingthe ordered translated tokens 410 to the semantic ontology correctionpost-processing stage 225.

FIG. 5 shows an illustrative finite-state transducer (FST) cascade 505that may be used to implement sentence splitting and simplificationelement 310 of the natural language structuring pre-processing stage 215(FIG. 2 ). FST cascades are well adapted to represent the tokenizationpaths in the linguistic patterns that may be commonly encountered inclinical trial documents. Sentence splitting is performed beforesubsequent text processing is performed to break up the input text intodistinct and meaningful units. Splitting would be straightforward toaccomplish if the source natural language is perfectly punctuated.However, even when expressed in well punctuated languages, the sourceclinical trial document text 510 will typically include ambiguities thatwill result in multiple tokenization options for dividing one meaningfulunit from an adjacent meaningful unit. To improve tokenizationperformance, the FST cascade 505 uses a gazetteer list of abbreviations520 to help distinguish sentence-marking full stops from other markings.

The FST cascade 505 provides tokens 515 that comprise text segments thathave reduced complexity and length compared with the source text. Thetokens identify key sentence structures that can improve translationperformance by the neural machine translation engine 125 (FIG. 1 ) bybeing semantically meaningful. In addition, as noted above, the FSTcascade provides the tokens as an ordered output for translation by theneural machine translation engine.

FIG. 6 shows an illustrative named entity recognition system 605 thatmay be used to implement the named entity recognition element 315 of thenatural language structuring pre-processing stage 215 (FIG. 2 ). Namedentities may create problems for machine translation systems and cancause translation failures that impact overall morphosyntacticwell-formedness of sentences and word sense disambiguation in the sourceclinical trial document text. The named entity recognition system 605employs methodologies that implement different approaches to translationof named entities compared with other types of words. For example,foreign person names in Russian should be transcribed and written inCyrillic, and names that coincide with common nouns should not be lookedup in the general dictionary.

The named entity recognition system 605 is configured to compareclinical trial document text 610 against entries in a named entity tableor database 615. The system can use the results in various ways such asexcluding named entities from translation 620, for example, names oforganizations. Such selective translation exclusion may help to maximizeopportunities to match document text with translation memory, asdescribed below. Recognized information, such as confidentialinformation or personally identifiable information, can be masked 625.Recognized information can also be extracted 630 from the sourcedocument and used for various purposes. In some cases, information thatis excluded from translation by the neural machine translation engine125 (FIG. 1 ) can be translated using a user-defined glossary.

FIG. 7 shows an illustrative text matching system 705 that may be usedto implement the fast fuzzy matching 320 element of the natural languagestructuring pre-processing stage 215 (FIG. 2 ) in which clinical trialdocument text 710 is compared against entries in a translation memory720. The translation memory includes existing translated clinical trialdocuments that can be matched against source text to thereby generatetranslated text 715 using a separate process from the neural machinetranslation engine 125 (FIG. 1 ). The matching can be implemented usingfuzzy logic in which matches between document text and the translationmemory can be less than 100 percent. The text matching system canprovide an expression 725 of the fuzzy match in some implementations.Translation memory matches are expressed as percentages in which aperfect match is a 100% match, and fuzzy matches are less than 100%matches.

The translation memory 720 can be optimized by processing existingtranslated clinical trial documents to remove incorrect or confusinglanguage conversions that do not make sense. Such optimization canimprove matching effectiveness and increase document translationaccuracy. The text matching system 705 can be implemented using fastsearch algorithms that enable performant matching by improving theretrieval of salient information from the translation memory which canbe large.

FIG. 8 shows an illustrative transformation system 805 that may be usedto implement the transformational grammar element 325 in the naturallanguage structuring pre-processing stage 215 (FIG. 2 ). Thetransformation system is configured to transform clinical trial documenttext 810 into a single representation for text that has the same meaning815 using a series of transformations. The transformation system 805also exposes a tagger 835 that is configured to identify and taginformation for parts of speech of the clinical trial document text suchas verb, noun, adjective, etc.

As shown, the transformations include a passive voice sentencetransformation 820 in which passive voice sentences are detected andtransformed into active voice sentences. An indirect sentencetransformation 825 detects indirect sentences and transforms them intodirect sentences. A word re-ordering transformation 830 re-orders wordsin the source document text according to language structures that areappropriate for the target language, for example, to accommodate themore formalized layout of the sentence in German as compared to Spanish.

FIG. 9 shows illustrative manual adjustments 905 to a translatedclinical trial document 910 that may be utilized as machine learninginput 915 to the computer-implemented translation tool 130. In thisexample, the human translator can perform a comparison 920 between thetranslated clinical trial document in the target natural language andthe original clinical trial document 925 in the source natural language.The translator may make adjustments to the document that the translationtool can analyze to make appropriate changes in the underlying automatedtranslation processes. Alternatively, the translator may directly adjustthe processes themselves to achieve a desired outcome. In some cases,the translator may perform multiple translation iterations to assist themachine-learning process by specifying different translation outcomes,or varying processing parameters with each iteration.

FIGS. 10, 11, and 12 show illustrative methods for automated translationof clinical trial documents. Unless specifically stated, methods orsteps shown in the flowcharts and described in the accompanying text arenot constrained to a particular order or sequence. In addition, some ofthe methods or steps thereof can occur or be performed concurrently andnot all the methods or steps have to be performed in a givenimplementation depending on the requirements of such implementation andsome methods or steps may be optionally utilized.

FIG. 10 is a flowchart of an illustrative method 1000 that may beperformed by a computing device that executes an automated translationtool for translating a clinical trial document from a source naturallanguage to a target natural language. In step 1005, the devicereceives, as an input, an electronic representation of the clinicaltrial document in a source natural language in which text of theclinical trial document is digitized. In step 1010, the device splitssentences and pre-orders words in the digitized text into segmentshaving reduced complexity relative to unsplit sentences. In step 1015,the device recognizes words in the digitized text that match entries ina named entity table. In step 1020, the device matches segments toentries in a translation memory database of existing clinical trialdocuments that are translated from the source natural language to atleast partially translate the matched segment from the source naturallanguage to the target natural language. In step 1025, the deviceapplies transformational grammar to the digitized text to produce asingle representation of sentences in the clinical trial documents thatshare a common meaning. In step 1030, the device provides, as an output,a machine-understandable representation of the source natural languagetext that includes semantic meaning.

FIG. 11 is a flowchart of an illustrative method 1100 that may beperformed by a computing device that executes an automated translationtool for translating a clinical trial document expressed in a sourcenatural language to a target natural language. In step 1105, the deviceimplements a cascade of finite-state transducers to split text in theclinical trial document into segments by identifying sentence boundariesusing a gazetteer list of abbreviations to identify sentence markingstops. In step 1110, the device performs named entity recognition on thetext to identify text that is excluded from translation to the targetnatural language. In step 1115, the device searches a translation memoryfor fuzzy matches between segments and existing translations between thesource natural language and the target natural language. In step 1120,the device implements a speech tagger configured to identify parts ofspeech in the text. In step 1125, the device grammatically transformsthe text to provide a single representation of sentences that have acommon meaning. In step 1130, the device implements an applicationprogramming interface (API) to an external translation engine over whichsegments are transmitted for translation by the engine. In step 1135,the device receives translated segments in the target natural languagefrom the external translation engine. In step 1140, the device correctsthe translated segments for clinical trial acronyms or medicalterminology. In step 1145, the device reconstructs the clinical trialdocument using the corrected translated segments in the target naturallanguage.

FIG. 12 is a flowchart of an illustrative method 1200 that may beperformed by a computing device that executes an automated translationtool. In step 1205, the device deconstructs elements of one or moreclinical trial documents into a standardized data structure to generatedigitized text as an input to the computer-implemented translation tool.In step 1210, the device pre-processes the digitized text into tokens toidentify key sentence structures to optimize neural machine translationof the clinical trial documents from a source language to a targetlanguage, the key sentence structures expressing relationships within asemantic domain of one of clinical trial or medical terminology, thepre-processing further identifying a token order for neural machinetranslation. In step 1215, the device provides the pre-processeddigitized text to a neural machine translation engine token by token inthe identified order. In step 1220, the device receives translatedtokens in the target language from the neural machine translationengine. In step 1225, the device post-processes the received translatedtokens to correct ontology in the semantic domain of one of clinicaltrial or medical terminology. In step 1230, the device reconstructsclinical trial documents using the translated tokens with correctedsemantic ontology in which the reconstructed clinical trial documents inthe target language maintain characteristics of the original clinicaltrial documents in the source language, the characteristics includingone of formatting or embedded images.

FIG. 13 shows an illustrative architecture 1300 for a device, such as aserver, capable of executing the various components described herein forthe present automated translation of clinical trial documents. Thearchitecture 1300 illustrated in FIG. 13 includes one or more processors1302 (e.g., central processing unit, dedicated artificial intelligencechip, graphic processing unit, etc.), a system memory 1304, includingRAM (random access memory) 1306 and ROM (read only memory) 1308, and asystem bus 1310 that operatively and functionally couples the componentsin the architecture 1300. A basic input/output system containing thebasic routines that help to transfer information between elements withinthe architecture 1300, such as during startup, is typically stored inthe ROM 1308. The architecture 1300 further includes a mass storagedevice 1312 for storing software code or other computer-executed codethat is utilized to implement applications, the file system, and theoperating system. The mass storage device 1312 is connected to theprocessor 1302 through a mass storage controller (not shown) connectedto the bus 1310. The mass storage device 1312 and its associatedcomputer-readable storage media provide non-volatile storage for thearchitecture 1300. Although the description of computer-readable storagemedia contained herein refers to a mass storage device, such as a harddisk, solid state drive, or optical drive, it may be appreciated thatcomputer-readable storage media can be any available storage media thatcan be accessed by the architecture 1300.

By way of example, and not limitation, computer-readable storage mediamay include volatile and non-volatile, removable and non-removable mediaimplemented in any method or technology for storage of information suchas computer-readable instructions, data structures, program modules, orother data. For example, computer-readable media includes, but is notlimited to, RAM, ROM, EPROM (erasable programmable read only memory),EEPROM (electrically erasable programmable read only memory), Flashmemory or other solid state memory technology, CD-ROM, DVDs, HD-DVD(High Definition DVD), Blu-ray, or other optical storage, magneticcassettes, magnetic tape, magnetic disk storage or other magneticstorage devices, or any other medium which can be used to store thedesired information and which can be accessed by the architecture 1300.

According to various embodiments, the architecture 1300 may operate in anetworked environment using logical connections to remote computersthrough a network. The architecture 1300 may connect to the networkthrough a network interface unit 1316 connected to the bus 1310. It maybe appreciated that the network interface unit 1316 also may be utilizedto connect to other types of networks and remote computer systems. Thearchitecture 1300 also may include an input/output controller 1318 forreceiving and processing input from several other devices, including akeyboard, mouse, touchpad, touchscreen, control devices such as buttonsand switches or electronic stylus (not shown in FIG. 13 ). Similarly,the input/output controller 1318 may provide output to a display screen,user interface, a printer, or other type of output device (also notshown in FIG. 13 ).

It may be appreciated that the software components described herein may,when loaded into the processor 1302 and executed, transform theprocessor 1302 and the overall architecture 1300 from a general-purposecomputing system into a special-purpose computing system customized tofacilitate the functionality presented herein. The processor 1302 may beconstructed from any number of transistors or other discrete circuitelements, which may individually or collectively assume any number ofstates. More specifically, the processor 1302 may operate as afinite-state machine, in response to executable instructions containedwithin the software modules disclosed herein. These computer-executableinstructions may transform the processor 1302 by specifying how theprocessor 1302 transitions between states, thereby transforming thetransistors or other discrete hardware elements constituting theprocessor 1302.

Encoding the software modules presented herein also may transform thephysical structure of the computer-readable storage media presentedherein. The specific transformation of physical structure may depend onvarious factors, in different implementations of this description.Examples of such factors may include, but are not limited to, thetechnology used to implement the computer-readable storage media,whether the computer-readable storage media is characterized as primaryor secondary storage, and the like. For example, if thecomputer-readable storage media is implemented as semiconductor-basedmemory, the software disclosed herein may be encoded on thecomputer-readable storage media by transforming the physical state ofthe semiconductor memory. For example, the software may transform thestate of transistors, capacitors, or other discrete circuit elementsconstituting the semiconductor memory. The software also may transformthe physical state of such components in order to store data thereupon.

As another example, the computer-readable storage media disclosed hereinmay be implemented using magnetic or optical technology. In suchimplementations, the software presented herein may transform thephysical state of magnetic or optical media, when the software isencoded therein. These transformations may include altering the magneticcharacteristics of particular locations within given magnetic media.These transformations also may include altering the physical features orcharacteristics of particular locations within given optical media tochange the optical characteristics of those locations. Othertransformations of physical media are possible without departing fromthe scope and spirit of the present description, with the foregoingexamples provided only to facilitate this discussion.

In light of the above, it may be appreciated that many types of physicaltransformations take place in the architecture 1300 in order to storeand execute the software components presented herein. It also may beappreciated that the architecture 1300 may include other types ofcomputing devices, including wearable devices, handheld computers,embedded computer systems, smartphones, PDAs, and other types ofcomputing devices known to those skilled in the art. It is alsocontemplated that the architecture 1300 may not include all of thecomponents shown in FIG. 13 , may include other components that are notexplicitly shown in FIG. 13 , or may utilize an architecture completelydifferent from that shown in FIG. 13 .

FIG. 14 is a simplified block diagram of an illustrative computer system1400 such as a PC, client machine, or server with which the presentautomated translation of clinical trial documents may be implemented.Computer system 1400 includes a processor 1405, a system memory 1411,and a system bus 1414 that couples various system components includingthe system memory 1411 to the processor 1405. The system bus 1414 may beany of several types of bus structures including a memory bus or memorycontroller, a peripheral bus, or a local bus using any of a variety ofbus architectures. The system memory 1411 includes read only memory(ROM) 1417 and random access memory (RAM) 1421. A basic input/outputsystem (BIOS) 1425, containing the basic routines that help to transferinformation between elements within the computer system 1400, such asduring startup, is stored in ROM 1417. The computer system 1400 mayfurther include a hard disk drive 1428 for reading from and writing toan internally disposed hard disk (not shown), a magnetic disk drive 1430for reading from or writing to a removable magnetic disk 1433 (e.g., afloppy disk), and an optical disk drive 1438 for reading from or writingto a removable optical disk 1443 such as a CD (compact disc), DVD(digital versatile disc), or other optical media. The hard disk drive1428, magnetic disk drive 1430, and optical disk drive 1438 areconnected to the system bus 1414 by a hard disk drive interface 1446, amagnetic disk drive interface 1449, and an optical drive interface 1452,respectively. The drives and their associated computer-readable storagemedia provide non-volatile storage of computer-readable instructions,data structures, program modules, and other data for the computer system1400. Although this illustrative example includes a hard disk, aremovable magnetic disk 1433, and a removable optical disk 1443, othertypes of computer-readable storage media which can store data that isaccessible by a computer such as magnetic cassettes, Flash memory cards,digital video disks, data cartridges, random access memories (RAMs),read only memories (ROMs), and the like may also be used in someapplications of the present automated translation of clinical trialdocuments. In addition, as used herein, the term computer-readablestorage media includes one or more instances of a media type (e.g., oneor more magnetic disks, one or more CDs, etc.). For purposes of thisspecification and the claims, the phrase “computer-readable storagemedia” and variations thereof, are intended to cover non-transitoryembodiments, and does not include waves, signals, and/or othertransitory and/or intangible communication media.

A number of program modules may be stored on the hard disk, magneticdisk 1433, optical disk 1443, ROM 1417, or RAM 1421, including anoperating system 1455, one or more application programs 1457, otherprogram modules 1460, and program data 1463. A user may enter commandsand information into the computer system 1400 through input devices suchas a keyboard 1466 and pointing device 1468 such as a mouse. Other inputdevices (not shown) may include a microphone, joystick, game pad,satellite dish, scanner, trackball, touchpad, touchscreen,touch-sensitive device, voice-command module or device, user motion oruser gesture capture device, or the like. These and other input devicesare often connected to the processor 1405 through a serial portinterface 1471 that is coupled to the system bus 1414, but may beconnected by other interfaces, such as a parallel port, game port, oruniversal serial bus (USB). A monitor 1473 or other type of displaydevice is also connected to the system bus 1414 via an interface, suchas a video adapter 1475. In addition to the monitor 1473, personalcomputers typically include other peripheral output devices (not shown),such as speakers and printers. The illustrative example shown in FIG. 14also includes a host adapter 1478, a Small Computer System Interface(SCSI) bus 1483, and an external storage device 1476 connected to theSCSI bus 1483.

The computer system 1400 is operable in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 1488. The remote computer 1488 may be selected as anotherpersonal computer, a server, a router, a network PC, a peer device, orother common network node, and typically includes many or all of theelements described above relative to the computer system 1400, althoughonly a single representative remote memory/storage device 1490 is shownin FIG. 14 . The logical connections depicted in FIG. 14 include a localarea network (LAN) 1493 and a wide area network (WAN) 1495. Suchnetworking environments are often deployed, for example, in offices,enterprise-wide computer networks, intranets, and the Internet.

When used in a LAN networking environment, the computer system 1400 isconnected to the local area network 1493 through a network interface oradapter 1496. When used in a WAN networking environment, the computersystem 1400 typically includes a broadband modem 1498, network gateway,or other means for establishing communications over the wide areanetwork 1495, such as the Internet. The broadband modem 1498, which maybe internal or external, is connected to the system bus 1414 via aserial port interface 1471. In a networked environment, program modulesrelated to the computer system 1400, or portions thereof, may be storedin the remote memory storage device 1490. It is noted that the networkconnections shown in FIG. 14 are illustrative and other means ofestablishing a communications link between the computers may be useddepending on the specific requirements of an application of the presentautomated translation of clinical trial documents.

The subject matter described above is provided by way of illustrationonly and is not to be construed as limiting. Various modifications andchanges may be made to the subject matter described herein withoutfollowing the example embodiments and applications illustrated anddescribed, and without departing from the true spirit and scope of thepresent invention, which is set forth in the following claims.

What is claimed:
 1. A computer-implemented method comprising:identifying a plurality of segments in a received clinical trialdocument; for each of the identified segments, matching the identifiedsegment to a portion of a translated clinical trial document stored in adatabase to translate the identified segment from a first language to asecond language; producing a representation for each set of sentences ofthe received clinical trial document that share a common meaning byapplying at least one transformational technique to each of theidentified segments of the received clinical trial document; andproviding an output of the clinical trial document in the secondlanguage by reconstructing the clinical trial document from thetranslated identified segments and the representations for each set ofsentences that share a common meaning.
 2. The method of claim 1, furthercomprising: applying the at least one transformational technique to oneor more of the identified segments of the clinical trial document byusing sentence splitting to identify tokens that determine key sentencestructures.
 3. The method of claim 1, further comprising: applying theat least one transformational technique to one or more of the identifiedsegments of the clinical trial document by re-arranging a set of wordswithin the clinical trial document to enable the set of words to share acommon meaning.
 4. The method of claim 1, wherein one or more of theidentified segments of the clinical trial document share a commonmeaning in the second language.
 5. The method of claim 1, wherein thedatabase comprises a plurality of translated clinical trial documentstranslated from the first language to the second language.
 6. The methodof claim 1, further comprising: matching one or more of the identifiedsegments of the received clinical trial document to one or more sourcesof additional information within the database.
 7. The method of claim 1,wherein the step of matching the identified segment to the portion ofthe translated clinical trial document comprises applying a plurality ofmatching patterns to the identified segment of the clinical trialdocument.
 8. A computer program product comprising a non-transitorystorage medium having processor-readable instructions stored thereonthat, when executed by one or more processors, cause the computerprogram product to: identify a plurality of segments in a receivedclinical trial document; for each of the identified segments, match theidentified segment to a portion of a translated clinical trial documentstored in a database to translate the identified segment from a firstlanguage to a second language; produce a representation for each set ofsentences of the received clinical trial document that share a commonmeaning by applying at least one transformational technique to each ofthe identified segments of the received clinical trial document; andprovide an output of the clinical trial document in the second languageby reconstructing the clinical trial document from the translatedidentified segments and the representations for each set of sentencesthat share a common meaning.
 9. The computer program product of claim 8,wherein the at least one transformational technique is applied to one ormore of the identified segments of the clinical trial document by usingsentence splitting to identify tokens that determine key sentencestructures.
 10. The computer program product of claim 8, wherein the atleast one transformational technique is applied to one or more of theidentified segments of the clinical trial document by re-arranging a setof words within the clinical trial document to enable the set of wordsto share a common meaning.
 11. The computer program product of claim 8,wherein one or more of the identified segments share a common meaning inthe second language.
 12. The computer program product of claim 8,wherein the database comprises a plurality of translated clinical trialdocuments translated from the first language to the second language. 13.The computer program product of claim 8, wherein one or more of theidentified segments of the received clinical trial document are matchedto one or more sources of additional information within the database.14. The computer program product of claim 8, wherein the step ofmatching the identified segment to the portion of the translatedclinical trial document comprises applying a plurality of matchingpatterns to the identified segment of the clinical trial document.
 15. Acomputer system connected to a network, the system comprising: a memoryconfigured to store instructions; one or more processors configured toexecute the instructions and configured to: identify a plurality ofsegments in a received clinical trial document; for each of theidentified segments, match the identified segment to a portion of atranslated clinical trial document stored in a database to translate theidentified segment from a first language to a second language; produce arepresentation for each set of sentences of the received clinical trialdocument that share a common meaning by applying at least onetransformational technique to each of the identified segments of thereceived clinical trial document; and provide an output of the clinicaltrial document in the second language by reconstructing the clinicaltrial document from the translated identified segments and therepresentations for each set of sentences that share a common meaning.16. The system of claim 15, wherein the at least one transformationaltechnique is applied to one or more of the identified segments of theclinical trial document by using sentence splitting to identify tokensthat determine key sentence structures.
 17. The system of claim 15,wherein the at least one transformational technique is applied to one ormore of the identified segments of the clinical trial document byre-arranging a set of words within the clinical trial document to enablethe set of words to share a common meaning.
 18. The system of claim 15,wherein one or more of the identified segments of the clinical trialdocument share a common meaning in the second language.
 19. The systemof claim 15, wherein the database comprises a plurality of translatedclinical trial documents translated from the first language to thesecond language.
 20. The system of claim 15, wherein one or more of theidentified segments are matched to one or more sources of additionalinformation within the database.